{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SUknhuHKyc-E"
   },
   "source": [
    "<p style=\"text-align:center\">\n",
    "    <img alt=\"Ragas\" src=\"https://github.com/explodinggradients/ragas/blob/main/docs/_static/imgs/logo.png?raw=true\" width=\"250\">\n",
    "<br>\n",
    "<p>\n",
    "  <a href=\"https://github.com/explodinggradients/ragas\">Ragas</a>\n",
    "  |\n",
    "  <a href=\"https://docs.arize.com/arize/\">Arize</a>\n",
    "  |\n",
    "  <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg\">Slack Community</a>\n",
    "</p>\n",
    "\n",
    "<center><h1>Tracing and Evaluating AI agents</h1></center>\n",
    "\n",
    "This guide will walk you through the process of creating and evaluating agents using Ragas and Arize Phoenix. We'll cover the following steps:\n",
    "\n",
    "* Build an agent that solves math problems with the OpenAI Agents SDK\n",
    "\n",
    "* Trace agent activity to monitor interactions\n",
    "\n",
    "* Generate a benchmark dataset for performance analysis\n",
    "\n",
    "* Evaluate agent performance using Ragas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "baTNFxbwX1e2"
   },
   "source": [
    "# Initial setup\n",
    "\n",
    "\n",
    "We'll setup our libraries, keys, and OpenAI tracing using Phoenix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "n69HR7eJswNt"
   },
   "source": [
    "### Install Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q \"arize-phoenix>=8.0.0\" arize-phoenix-otel openinference-instrumentation-openai-agents openinference-instrumentation-openai arize-phoenix-evals \"arize[Datasets]\" --upgrade\n",
    "\n",
    "!pip install langchain-openai\n",
    "\n",
    "!pip install -q openai opentelemetry-sdk opentelemetry-exporter-otlp gcsfs nest_asyncio ragas openai-agents --upgrade"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jQnyEnJisyn3"
   },
   "source": [
    "### Setup Keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cmuChHhPd-Z7"
   },
   "source": [
    "Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also [connect to a self-hosted Phoenix instance](https://arize.com/docs/phoenix/deployment) if you'd prefer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n",
    "if not os.environ.get(\"PHOENIX_CLIENT_HEADERS\"):\n",
    "    os.environ[\"PHOENIX_CLIENT_HEADERS\"] = \"api_key=\" + getpass(\"Enter your Phoenix API key: \")\n",
    "\n",
    "if not os.environ.get(\"OPENAI_API_KEY\"):\n",
    "    os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kfid5cE99yN5"
   },
   "source": [
    "### Setup Tracing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.otel import register\n",
    "\n",
    "# Setup OpenTelemetry\n",
    "tracer_provider = register(\n",
    "    project_name=\"agents-cookbook\",\n",
    "    endpoint=\"https://app.phoenix.arize.com/v1/traces\",\n",
    "    auto_instrument=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bLVAqLi5_KAi"
   },
   "source": [
    "# Create your first agent with the OpenAI SDK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2RpnS8n8AirE"
   },
   "source": [
    "Here we've setup a basic agent that can solve math problems.\n",
    "\n",
    "We have a function tool that can solve math equations, and an agent that can use this tool.\n",
    "\n",
    "We'll use the `Runner` class to run the agent and get the final output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agents import Runner, function_tool\n",
    "\n",
    "\n",
    "@function_tool\n",
    "def solve_equation(equation: str) -> str:\n",
    "    \"\"\"Use python to evaluate the math equation, instead of thinking about it yourself.\n",
    "\n",
    "    Args:\"\n",
    "       equation: string which to pass into eval() in python\n",
    "    \"\"\"\n",
    "    return str(eval(equation))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agents import Agent\n",
    "\n",
    "agent = Agent(\n",
    "    name=\"Math Solver\",\n",
    "    instructions=\"You solve math problems by evaluating them with python and returning the result\",\n",
    "    tools=[solve_equation],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = await Runner.run(agent, \"what is 15 + 28?\")\n",
    "\n",
    "# Run Result object\n",
    "print(result)\n",
    "\n",
    "# Get the final output\n",
    "print(result.final_output)\n",
    "\n",
    "# Get the entire list of messages recorded to generate the final output\n",
    "print(result.to_input_list())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sq4rcseCGKRc"
   },
   "source": [
    "Now we have a basic agent, let's evaluate whether the agent responded correctly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y23F9ZP5AirG"
   },
   "source": [
    "# Evaluating our agent\n",
    "\n",
    "Agents can go awry for a variety of reasons. We can use Ragas to evaluate whether the agent responded correctly. Two Ragas measurements help with this:\n",
    "1. Tool call accuracy - did our agent choose the right tool with the right arguments?\n",
    "2. Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dr4C55YBAirG"
   },
   "source": [
    "Let's setup our evaluation by defining our task function, our evaluator, and our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "\n",
    "from agents import Runner\n",
    "\n",
    "\n",
    "# This is our task function. It takes a question and returns the final output and the messages recorded to generate the final output.\n",
    "async def solve_math_problem(input):\n",
    "    if isinstance(input, dict):\n",
    "        input = next(iter(input.values()))\n",
    "    result = await Runner.run(agent, input)\n",
    "    return {\"final_output\": result.final_output, \"messages\": result.to_input_list()}\n",
    "\n",
    "\n",
    "result = asyncio.run(solve_math_problem(\"What is 15 + 28?\"))\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9SHK2ZhxAirH"
   },
   "source": [
    "This is helper code which converts the agent messages into a format that Ragas can use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def conversation_to_ragas_sample(messages, reference_equation=None, reference_answer=None):\n",
    "    \"\"\"\n",
    "    Convert a single conversation into a Ragas MultiTurnSample.\n",
    "\n",
    "    Args:\n",
    "        conversation: Dictionary containing conversation data with 'conversation' key\n",
    "        reference_equation: Optional string with the reference equation for evaluation\n",
    "\n",
    "    Returns:\n",
    "        MultiTurnSample: Formatted sample for Ragas evaluation\n",
    "    \"\"\"\n",
    "\n",
    "    import json\n",
    "\n",
    "    from ragas.dataset_schema import MultiTurnSample\n",
    "    from ragas.messages import AIMessage, HumanMessage, ToolCall, ToolMessage\n",
    "\n",
    "    ragas_messages = []\n",
    "    pending_tool_call = None\n",
    "    reference_tool_calls = None\n",
    "\n",
    "    for item in messages:\n",
    "        role = item.get(\"role\")\n",
    "        item_type = item.get(\"type\")\n",
    "\n",
    "        if role == \"user\":\n",
    "            ragas_messages.append(HumanMessage(content=item[\"content\"]))\n",
    "            pending_tool_call = None\n",
    "\n",
    "        elif item_type == \"function_call\":\n",
    "            args = json.loads(item[\"arguments\"])\n",
    "            pending_tool_call = ToolCall(name=item[\"name\"], args=args)\n",
    "\n",
    "        elif item_type == \"function_call_output\":\n",
    "            if pending_tool_call is not None:\n",
    "                ragas_messages.append(AIMessage(content=\"\", tool_calls=[pending_tool_call]))\n",
    "                ragas_messages.append(\n",
    "                    ToolMessage(\n",
    "                        content=item[\"output\"],\n",
    "                        name=pending_tool_call.name,\n",
    "                        tool_call_id=\"tool_call_1\",\n",
    "                    )\n",
    "                )\n",
    "                pending_tool_call = None\n",
    "            else:\n",
    "                print(\"[WARN] ToolMessage without preceding ToolCall — skipping.\")\n",
    "\n",
    "        elif role == \"assistant\":\n",
    "            content = (\n",
    "                item[\"content\"][0][\"text\"]\n",
    "                if isinstance(item.get(\"content\"), list)\n",
    "                else item.get(\"content\", \"\")\n",
    "            )\n",
    "            ragas_messages.append(AIMessage(content=content))\n",
    "    print(\"Ragas_messages\", ragas_messages)\n",
    "\n",
    "    if reference_equation:\n",
    "        # Look for the first function call to extract the actual tool call\n",
    "        for item in messages:\n",
    "            if item.get(\"type\") == \"function_call\":\n",
    "                args = json.loads(item[\"arguments\"])\n",
    "                reference_tool_calls = [ToolCall(name=item[\"name\"], args=args)]\n",
    "                break\n",
    "\n",
    "        return MultiTurnSample(user_input=ragas_messages, reference_tool_calls=reference_tool_calls)\n",
    "\n",
    "    elif reference_answer:\n",
    "        return MultiTurnSample(user_input=ragas_messages, reference=reference_answer)\n",
    "\n",
    "    else:\n",
    "        return MultiTurnSample(user_input=ragas_messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here is an example of the function in action\n",
    "sample = conversation_to_ragas_sample(\n",
    "    # This is a list of messages recorded for \"Calculate 15 + 28.\"\n",
    "    result[\"messages\"],\n",
    "    reference_equation=\"15 + 28\",\n",
    "    reference_answer=\"43\",\n",
    ")\n",
    "print(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "o85Vvi_tAirI"
   },
   "source": [
    "Now let's setup our evaluator. We'll import both metrics we're measuring from Ragas, and use the `multi_turn_ascore(sample)` to get the results.\n",
    "\n",
    "The `AgentGoalAccuracyWithReference` metric compares the final output to the reference to see if the goal was accomplished.\n",
    "\n",
    "The `ToolCallAccuracy` metric compares the tool call to the reference tool call to see if the tool call was made correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "from ragas.llms import LangchainLLMWrapper\n",
    "from ragas.metrics import AgentGoalAccuracyWithReference, ToolCallAccuracy\n",
    "\n",
    "\n",
    "# Setup evaluator LLM and metrics\n",
    "async def tool_call_evaluator(input, output):\n",
    "    sample = conversation_to_ragas_sample(output[\"messages\"], reference_equation=input[\"question\"])\n",
    "    tool_call_accuracy = ToolCallAccuracy()\n",
    "    return await tool_call_accuracy.multi_turn_ascore(sample)\n",
    "\n",
    "\n",
    "async def goal_evaluator(input, output):\n",
    "    sample = conversation_to_ragas_sample(\n",
    "        output[\"messages\"], reference_answer=output[\"final_output\"]\n",
    "    )\n",
    "    evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\n",
    "    goal_accuracy = AgentGoalAccuracyWithReference(llm=evaluator_llm)\n",
    "    return await goal_accuracy.multi_turn_ascore(sample)\n",
    "\n",
    "\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "k0Qvn8tAs9vL"
   },
   "source": [
    "# Create synthetic dataset of questions\n",
    "\n",
    "Using the template below, we're going to generate a dataframe of 10 questions we can use to test our math problem solving agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MATH_GEN_TEMPLATE = \"\"\"\n",
    "You are an assistant that generates diverse math problems for testing a math solver agent.\n",
    "The problems should include:\n",
    "\n",
    "Basic Operations: Simple addition, subtraction, multiplication, division problems.\n",
    "Complex Arithmetic: Problems with multiple operations and parentheses following order of operations.\n",
    "Exponents and Roots: Problems involving powers, square roots, and other nth roots.\n",
    "Percentages: Problems involving calculating percentages of numbers or finding percentage changes.\n",
    "Fractions: Problems with addition, subtraction, multiplication, or division of fractions.\n",
    "Algebra: Simple algebraic expressions that can be evaluated with specific values.\n",
    "Sequences: Finding sums, products, or averages of number sequences.\n",
    "Word Problems: Converting word problems into mathematical equations.\n",
    "\n",
    "Do not include any solutions in your generated problems.\n",
    "\n",
    "Respond with a list, one math problem per line. Do not include any numbering at the beginning of each line.\n",
    "Generate 10 diverse math problems. Ensure there are no duplicate problems.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "import pandas as pd\n",
    "\n",
    "from phoenix.evals import OpenAIModel\n",
    "\n",
    "nest_asyncio.apply()\n",
    "pd.set_option(\"display.max_colwidth\", 500)\n",
    "\n",
    "# Initialize the model\n",
    "model = OpenAIModel(model=\"gpt-4o\", max_tokens=1300)\n",
    "\n",
    "# Generate math problems\n",
    "resp = model(MATH_GEN_TEMPLATE)\n",
    "\n",
    "# Create DataFrame\n",
    "split_response = resp.strip().split(\"\\n\")\n",
    "math_problems_df = pd.DataFrame(split_response, columns=[\"question\"])\n",
    "math_problems_df = math_problems_df[math_problems_df[\"question\"].str.strip() != \"\"]\n",
    "\n",
    "print(math_problems_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oGIbV49kHp4H"
   },
   "source": [
    "Now let's use this dataset and run it with the agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "\n",
    "from agents import Runner\n",
    "\n",
    "# Run agent on all problems and collect conversation history\n",
    "conversations = []\n",
    "for idx, row in math_problems_df.iterrows():\n",
    "    result = asyncio.run(solve_math_problem(row[\"question\"]))\n",
    "    conversations.append(\n",
    "        {\n",
    "            \"question\": row[\"question\"],\n",
    "            \"final_output\": result[\"final_output\"],\n",
    "            \"messages\": result[\"messages\"],\n",
    "        }\n",
    "    )\n",
    "\n",
    "print(f\"Processed {len(conversations)} problems\")\n",
    "print(conversations[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cHYgS5cpRE3b"
   },
   "source": [
    "# Create an experiment\n",
    "\n",
    "With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SgTEu7U4Rd5i"
   },
   "source": [
    "Let's create this dataset and upload it into the platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import phoenix as px\n",
    "\n",
    "dataset_df = pd.DataFrame(\n",
    "    {\n",
    "        \"question\": [conv[\"question\"] for conv in conversations],\n",
    "        \"final_output\": [conv[\"final_output\"] for conv in conversations],\n",
    "    }\n",
    ")\n",
    "\n",
    "dataset = px.Client().upload_dataset(\n",
    "    dataframe=dataset_df,\n",
    "    dataset_name=\"math-questions\",\n",
    "    input_keys=[\"question\"],\n",
    "    output_keys=[\"final_output\"],\n",
    ")\n",
    "\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ykdwc-XSgMJx"
   },
   "source": [
    "Finally, we run our experiment and view the results in Phoenix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.experiments import run_experiment\n",
    "\n",
    "experiment = run_experiment(\n",
    "    dataset, task=solve_math_problem, evaluators=[goal_evaluator, tool_call_evaluator]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "T088EMx4gXKS"
   },
   "source": [
    "![Results](https://storage.googleapis.com/arize-phoenix-assets/assets/gifs/ragas_results.gif)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
